Homework 4:

  1. Follow the steps below to:
    • Read wine.csv in the data folder.
    • The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
  2. Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
  3. Try PCA and see how much can you reduce the variable space.
    • How many Components did you need to explain 99% of variance in this dataset?
    • Plot the PCA variables to see if it brings out the clusters.
  4. Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks


In [105]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
%matplotlib inline
np.set_printoptions(suppress= True)

In [106]:
wine = pd.read_csv('../data/wine.csv')

In [107]:
wine.tail()


Out[107]:
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52 1.06 7.7 0.64 1.74 740
174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43 1.41 7.3 0.70 1.56 750
175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.2 0.59 1.56 835
176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.3 0.60 1.62 840
177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.2 0.61 1.60 560

In [108]:
wine.Wine = wine.Wine - 1

In [109]:
y = wine.Wine

In [110]:
X = wine.iloc[:,1:]

In [111]:
kmeans = KMeans(n_clusters = 3, random_state = 1)
Y_hat_kmeans = kmeans.fit(X).labels_

In [112]:
plt.scatter(X.ix[:,0], X.ix[:,1], c = Y_hat_kmeans, s = X.ix[:,4]*2)


Out[112]:
<matplotlib.collections.PathCollection at 0x110417e50>

In [113]:
print confusion_matrix(Y_hat_kmeans, y)
plt.matshow(confusion_matrix(Y_hat_kmeans, y))
plt.title('confusion matrix')
plt.xlabel('Y_hat_kmeans')
plt.ylabel('actual values')
plt.colorbar()


[[46  1  0]
 [ 0 50 19]
 [13 20 29]]
Out[113]:
<matplotlib.colorbar.Colorbar instance at 0x11052bb00>

In [113]:


In [114]:
from sklearn.decomposition import PCA
from sklearn import preprocessing

In [115]:
X_scale = preprocessing.scale(X)
comp = np.arange(14)
explained_var = []
for i in comp:
    pca = PCA(n_components= i)
    X_pca = pca.fit_transform(X_scale)
    explained_var.append(pca.explained_variance_ratio_.sum())
plt.plot(comp, explained_var)


Out[115]:
[<matplotlib.lines.Line2D at 0x11073a1d0>]

In [116]:
pca.explained_variance_ratio_


Out[116]:
array([ 0.36198848,  0.1920749 ,  0.11123631,  0.0706903 ,  0.06563294,
        0.04935823,  0.04238679,  0.02680749,  0.02222153,  0.01930019,
        0.01736836,  0.01298233,  0.00795215])

In [117]:
comp = np.arange(13) + 1
explained_var = []
for i in comp:
    pca = PCA(n_components= i)
    X_pca = pca.fit_transform(X)
    explained_var.append(pca.explained_variance_ratio_.sum())
plt.plot(comp, explained_var)


Out[117]:
[<matplotlib.lines.Line2D at 0x1111c0e10>]

In [118]:
print pca.explained_variance_ratio_


[ 0.99809123  0.00173592  0.00009496  0.00005022  0.00001236  0.00000846
  0.00000281  0.00000152  0.00000113  0.00000072  0.00000038  0.00000021
  0.00000008]

In [119]:
pca = PCA(n_components=4)
X_pca = pca.fit_transform(X)

In [120]:
Y_hat_kmeans = kmeans.fit(X_pca).labels_

In [121]:
plt.scatter(X_pca[:,0],X_pca[:,1], c = Y_hat_kmeans)
plt.colorbar()


Out[121]:
<matplotlib.colorbar.Colorbar instance at 0x111366908>

In [122]:
# compute distance matrix
from scipy.spatial.distance import pdist, squareform

distx = squareform(pdist(X_pca, metric='euclidean'))
distx


Out[122]:
array([[   0.        ,   31.21669984,  122.82027877, ...,  230.22891326,
         225.20472279,  506.05806074],
       [  31.21669984,    0.        ,  135.2135927 , ...,  216.21883165,
         211.20932631,  490.23283102],
       [ 122.82027877,  135.2135927 ,    0.        , ...,  350.56698345,
         345.55722715,  625.06802374],
       ..., 
       [ 230.22891326,  216.21883165,  350.56698345, ...,    0.        ,
           5.14098121,  276.07931837],
       [ 225.20472279,  211.20932631,  345.55722715, ...,    5.14098121,
           0.        ,  281.06097274],
       [ 506.05806074,  490.23283102,  625.06802374, ...,  276.07931837,
         281.06097274,    0.        ]])

In [123]:
kmeans = KMeans(n_clusters = 3, random_state = 1)
Y_hat_kmeans = kmeans.fit(X_pca).labels_

In [124]:
print confusion_matrix(Y_hat_kmeans, y)
plt.matshow(confusion_matrix(Y_hat_kmeans, y))
plt.title('confusion matrix')
plt.xlabel('Y_hat_kmeans')
plt.ylabel('actual values')
plt.colorbar()


[[46  1  0]
 [ 0 50 19]
 [13 20 29]]
Out[124]:
<matplotlib.colorbar.Colorbar instance at 0x111724b00>

In [124]:


In [124]:


In [124]:


In [124]:


In [ ]: